Key Questions:
Key Answers:
| No. | Name | Meaning | Data Name | Short Data Name | Analytic Type | Data Type | Unit Of Measure | Variable Type | Description / Comments |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Serial Number | Unique Data Record Serial # | Sl_No | sn | Quantitative | Numeric | Integer, Ordinal | Input | Predictor |
| 2 | Customer Key | Unique Customer ID # | Customer Key | cstid | Quantitative | Numeric | Integer, Nominal | Input | Predictor |
| 3 | Average Credit Limit | Cust: Average Credit Limit | Avg_Credit_Limit | crlmt | Quantitative | Numeric | Currency, Cardinal | Input | Predictor |
| 4 | Total Credit Cards | Cust: Total # of Credit Cards | Total_Credit_Cards | crcrds | Quantitative | Numeric | Count, Cardinal | Input | Predictor |
| 5 | Total Visits to Bank | Cust: Total # of Visits to Bank | Total_visits_bank | bvisits | Quantitative | Numeric | Count, Cardinal | Input | Predictor |
| 6 | Total Visits Online | Cust: Total # Of Visits Online | Total_visits_online | evisits | Quantitative | Numeric | Count, Cardinal | Input | Predictor |
| 7 | Total Calls Made | Cust: Total # Of Calls Made | Total_calls_made | calls | Quantitative | Numeric | Count, Cardinal | Input | Predictor |
| Pts | Criteria | Pts | Criteria | Pts | Criteria | ||
|---|---|---|---|---|---|---|---|
| 15 | EDA (Check for comments after every step) | 15 | Hierarchical Clustering | 05 | Clusters Comparison | ||
| 10 | Kmeans | 05 | Silhouette Score | 10 | Analysis the Clusters formed |
D1. Univariate Analysis : 10 Marks + 5 Marks (Code Comments, Explanations, Conclusions)
D2. Execute K Means Clustering : 10 Marks
D3. Execute Hierarchical Clustering : 15 Marks
D4. Calculate Average Silhouette Score for Both Methods : 5 Marks
D5. Compare K Means Clusters with Hierarchical Clusters : 5 Marks
D6. Analysis the Clusters formed : 10 Marks
Load Libraries : ⬇
# v====== Standard Libraries Begin ======v #
import warnings
warnings.filterwarnings('ignore')
# import pylab as pl # Mukesh Rao # Not Needed: np & plt takes care of it. Causes problem in supressing plot info. data outputs
import seaborn as sns # Data visualization for statistical graphics
import numpy as np # Numerical Python libraries
# random_state = np.random.RandomState(0) # From Mukesh Rao. MSB: Do we need to do this? Working ok without it.
import matplotlib.pyplot as plt # Data visualization for Ploting
import matplotlib.image as mpimg # To handle plot images, diskfile save, retrieve, render/display
from matplotlib import cm # Color Maps
from mpl_toolkits.mplot3d import Axes3D # MSB: For 3D plots: 5 USL KMeans Case Study Mr.Tom
%matplotlib inline
import pandas_profiling
import pandas as pd # to handle data in form of rows and columns
from pandas import ExcelWriter # Outputs a Excel disk file
from sklearn.svm import SVC # M.Rao; SVC = Support Vector Classification
from sklearn import metrics, svm # "svm" = Support Vector Machine > MRao; For Lin/Log Regr, DTree
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.utils import resample, shuffle # "shuffle"=> Mukesh Rao; Bagging Sample data set creation
from sklearn.model_selection import train_test_split, KFold, cross_val_score, LeaveOneOut, GridSearchCV, RandomizedSearchCV # Lin/LogR, DTree
from sklearn.pipeline import Pipeline, make_pipeline # M.Rao
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor # MRao, Tom
from sklearn.cluster import KMeans, AgglomerativeClustering # For 5 UL KMeans, Hierarchial
# For Linear Dimensionality (Cols/Attributes) Reduction to a Lower dimensional space (eg: reduce 15 cols to 2 cols):
from sklearn.decomposition import PCA # 5 UL : "Principal Component Analysis" for "Singular Value Decomposition" (SVD)
# ClusterCentroids=Cluster based UNDERsampling, TomekLinks=Under sampling by Deleting nearest majority neighbor/similar rows
from imblearn.under_sampling import RandomUnderSampler, ClusterCentroids, TomekLinks
from imblearn.over_sampling import SMOTE # Over sampler
from imblearn.combine import SMOTETomek # OVER / UP Sampling followed by UNDER / DOWN Sampling
from mlxtend.feature_selection import SequentialFeatureSelector as sfs # For Features selection
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs # For Plotting
# ====== For Linear Regression ======
from scipy.stats import zscore, pearsonr, randint as sp_randint # For LinReg
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage, fcluster # For 5UL Hierarchial
from scipy.spatial.distance import cdist, pdist # pdist = pairwise distr btwn datapoints : for 5UL K.Means, Hierarchial
from category_encoders import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeatures, binarize, LabelEncoder, OneHotEncoder # M.Rao
from sklearn.linear_model import LogisticRegression, LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, silhouette_score # LinRegr, USL: HierA Cluster
import statsmodels.api as sm # For OLS Summary in Linear Regression
import statsmodels.formula.api as smf # For OLS Summary in Linear Regression
from yellowbrick.regressor import ResidualsPlot
from yellowbrick.classifier import ClassificationReport, ROCAUC
# ====== For Logistic Regression ======
from sklearn.metrics import confusion_matrix, recall_score, precision_score, accuracy_score, \
f1_score, roc_curve, roc_auc_score, classification_report, auc # Mukesh Rao
# ====== For Decision Tree ======
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz
# from sklearn.externals.six import StringIO # MSB: Discontinued in Scikit Version 0.23 (available only upto ver 0.22)
import pydotplus as pdot # to display decision tree inline within the notebook
import graphviz as gviz
# DTree does not take strings as… # … input for the model fit step....
from sklearn.feature_extraction.text import CountVectorizer
# ======= For Ensemble Techniques =======
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, \
BaggingRegressor, RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
# ======= Utilities =======
import os, datetime, time, pickle # For model export
from os import system # For system commands from within python (eg "ls -l")
# ======= Set default style ========
# Multiple output displays per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.display import Image, Markdown
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>")) # Increase cell width
# ===== Options =====
pd.options.display.float_format = '{:,.2f}'.format # Remove scientific notations to display numbers with 2 decimals
pd.set_option('display.max_columns', 100) # Max df cols to display set to 100.
pd.set_option('display.max_rows', 50) # Max df rows to display set to 50.
# pd.set_option('display.max_rows', tdf.shape[0]+1) # just one row more than the total rows in df
# Update default style and size of charts
# plt.figure(figsize=(12,8))
plt.style.use('ggplot') # plt.style.use('classic') ??
plt.rcParams['figure.figsize'] = [10, 8]
sns.set_style(style='darkgrid')
# ====== Standard Libraries End ======^ #
Function Definition : ⬇
# BEGIN: Function describeX(): Manoj S Bhave: My Original Code Work. Package & Publish on PyPi Python Package Index / Python Library, etc...
# Future: Add Params: Dup.Rows = 'y/n', Dup.Cols = 'y/n' (for speed), #Modes to Display = n (0,1,2,3..n), ...
# ... ColType = 'all/num/obj' (All / Numeric / Object), Display Sample Row = True/False, etc...
# Pending: Code Comments...
def describeX(df, dcols = False, srows='', outl='iqr'):
"""
Input : dataframe, parameters
Output : 1. dataframe via display() 2. strings via print()
Usage : describeX(dataframe)
Author : Manoj S Bhave
Version : 0.1 Good To Go!
Function "describeX()" entends the output of the regular pd.df.describe() features with various other basic df stats features \
in a single output for convenience:
Inputs: dataframe (df)
Params:
1. dcols = False (default). If True, detects Duplicate COLUMNS in df. Can be SLOW for large df: If df.size > 1M, then: NOT Computed!
2. srows = '' (default). When specified, calls function showdfQ() to display df rows with various row selection criteria.
3. outl = 'iqr' (default). When specified, calls function showOutL() to show Outliers in all columns.
NOTE: More param info, Shift+Tab on functions: showdfQ(), showOutL()
Outputs: Extended Descriptive Statistics in a Unified Compact Tabular Snapshot for all Numeric and/or Object type df columns as \
applicable. These extended / aditional Features are as follows:
* "unq" : Unique values count * "null" : Null values count
* "zero" : Zero values count * "skew" : Skewness of column data distribution
* "OutL" : Number of Outliers Lower/Less than Q1-IOR*1.5 of column
* "OutH" : Number of Outliers HIgher/Less than Q3+IOR*1.5 of column
* "mode" : Mode value of col, displayed when UniMode ONLY
* "mode1" : 1st Mode value of col, displayed when MultiMode ONLY
* "mode2" : 2nd Mode value of col, displayed when MultiMode ONLY
* "modeN" : Total NUMBER of modes for the col, displayed when MultiMode ONLY
NOTE: Limited Output: Mode: When MultiMode, displays ONLY first TWO Modes for Simplicity. Indicates existance of multiple modes for a column.
Dataframe Level Overall Features: Displayed at the bottom of above Stats Table:
* "Rows" : Number of Rows in df * "Cols" : Number of Columns in df
* "DUP.Rows" : Duplicate Rows in df * "DUP.Cols" : df.size>1M=NOT Computed!
* "Elements" : Number of values in df (rows x cols)
* "Outliers" : Method to detect outliers' count in "OutL", "OutH" cols.
"""
dfs = pd.DataFrame()
dfs['cols'] = df.columns
dfs = dfs.set_index(keys=['cols'])
dfs['dtype'] = df.dtypes
dfs = dfs.join(df.describe(include='all').T)
drop_cols = ['count']
if 'unique' in dfs.columns:
drop_cols = ['count', 'unique']
dfs.drop(drop_cols, axis=1, inplace=True)
dfs['unq'] = df.nunique()
dfs['null'] = df.isna().sum()
dfs['zero'] = (df == 0).sum()
dfs['<0'] = (df.select_dtypes(include='number') < 0).sum()
dfs['skew'] = df.skew()
# Final: Outliers: Works OK!
outlrs = []
for col in df.select_dtypes(include='number'):
q1 = round(df[col].quantile(.25),2)
q3 = round(df[col].quantile(.75),2)
otr = round((q3-q1)*1.5, 2)
otl = round(q1 - otr, 2)
oth = round(q3 + otr, 2)
otls = (df[col] < otl).sum()
oths = (df[col] > oth).sum()
outlrs.append((col, otls, oths))
outlrs_df = pd.DataFrame(outlrs)
outlrs_df = outlrs_df.set_index(keys=[0])
dfs['OutL'] = outlrs_df[1]
dfs['OutH'] = outlrs_df[2]
# Final: Mode: It Works OK!
tmp_df = df.mode().T
if len(tmp_df.columns) > 1:
dfs['mode1'] = tmp_df[0]
dfs['mode2'] = tmp_df[1]
dfs['modeN'] = tmp_df.T.count()
else:
dfs['mode'] = tmp_df[0]
print('* DF Stats For All Columns:')
dfs.reset_index(inplace=True)
display(dfs)
dupcols = 'dcols=False'
if dcols:
if df.size < 1000000:
dupcols = df.T.duplicated().sum()
else:
dupcols = 'n/a:df.size>1M'
print('* Rows:', df.shape[0], ' * Cols:', df.shape[1], ' * Elements:', df.size,
' * DUP.Rows:', df.duplicated().sum(), ' * DUP.Cols:', dupcols,
' * Outliers: OutL < Q1-IORx1.5, OutH > Q3+IORx1.5') #' * Type:', type(df)
if srows != '': showdfQ(df, srows)
# End: Function describeX().
# BEGIN: Function showOutL(): Manoj S Bhave: My Original Code Work. Package & Publish on PyPi Python Package Index / Python Library, etc...
# Future: Add Params: Activate 'zscore', Outlier_Update_Method = zscore < 0.3, iqr, inside_whisker_end, Q1/Q3 mean, median, mode, KNN,
def showOutL(df, mthd='iqr', impute=False):
"""
Input : dataframe, parameters
Output : Lists Column wise Outliers via print()
Usage : showOutL(dataframe, params)
Author : Manoj S Bhave
Created : Tue.Jul.21 2020
Version : 0.1 (Good To Go!)
Function "showOutL" Detects and Displays all Outliers, based on IQR Method, for all Numeric columns. For each column it displays \
Total Outlier count (Low + High), Separate Low & High counts, Low & High Whisker values, Q1 & Q3 values as potential Imputing values.
Inputs: dataframe (df)
Params:
1. 'mthd' = 'iqr (default): Method to be used for Outlier Detection. Other Method value: 'zscore' (Not yet Implemented)
2. 'impute' = False (default): Outliers Display only (No Update). When True, it updates the Outlier value with applicable Q1 / Q3 values \
for each Column, for all Rows. Then to check & verify Imputations, it reruns itself in 'impute' = False mode to Update & Display \
the New metrics based on New Imputed values for verification.
IQR Method Used: LOW Whisker End = Q1 - (IQR * 1.5); HIGH Whisker End = Q3 + (IQR *1.5) are calculated for a Column. Column values \
that are Below (Less Than) the Low End and/or Col values that are Above (Greater Than) the High End and identified as "Outliers" for \
"Display Only" for "Update" based on param value 'Impute' = False/True.
NOTE: Limitation: Outlier values are update with Q1 and Q3 values only. Other values can be used like 'Mean', 'Median', 'Mode', etc.
Outputs:
1. When 'impute' = False: Lists all Outliers for each Numeric Col as mentioned above with Total (Low + High) Outliers counts along \
with individual Low & High subcounts. Displays Low & High cut off Whisker end values along with Q1 & Q3 values for each Col resp.
2. When 'impute' = True: Lists the completion status collectively & individually for each Col along with Total values Imputed for \
each Col. After completion, this function ReRuns itself (without Further Imputation) to ReDetect any New Outliers to check, verify, \
update the new Metrics due to the newly updated Outlier values.
NOTE: If New Outliers are noticed, then if needed, this process can be Rerun in 'impute' = True to fix these new outliers. Process can \
be repeated till it makes business sense to live with some marginal outliers or not without compromising too much on information loss \
due to 'fixed' Col values as a result of Outlier Elimination process.
"""
# Detect & Display All Outliers for All Numeric Cols with IQR Method and Impute with respective columns' Q1 or Q3 values as applicable:
print('\n* Outliers Detection: IQR Method: For All Numerical Columns: LOW Whisker = Q1 - (IQR * 1.5); HIGH Whisker = Q3 + (IQR *1.5): \n')
for col in df.select_dtypes(include='number'):
q1 = round(df[col].quantile(.25),2)
q3 = round(df[col].quantile(.75),2)
otr = round((q3 - q1) * 1.5, 2)
otl = round(q1 - otr,2)
oth = round(q3 + otr,2)
otls = (df[col] < otl).sum()
oths = (df[col] > oth).sum()
if impute:
df[col] = np.where((df[col] < otl), q1, df[col])
df[col] = np.where((df[col] > oth), q3, df[col])
print('Col:', col, otls+oths, 'Values Imputed')
elif (otls + oths) != 0:
print('\n* Column:', col, ': Total', otls+oths, 'Outlier Values:')
if otls != 0:
print(' ', str(otls).rjust(3, ' '), 'outliers Under Low Whisker =', str(otl).rjust(7, ' '), ': Impute with Q1 value:', q1)
if oths != 0:
print(' ', str(oths).rjust(3, ' '), 'outliers Over High Whisker =', str(oth).rjust(7, ' '), ': Impute with Q3 value:', q3)
else: print('\n* Column:', col, ': Total', otls+oths, 'Outlier Values: NO OUTLIERS!')
if impute: # ReRun to get new matrix after Outlier fix to display the New Outlier count:
print(*'\nImpute Complete by IQR Method: Check to verify Outliers Elimination')
showOutL(df, impute=False)
# End: Function showOutL().
# BEGIN: Function showdfQ(): Manoj S Bhave: My Original Code Work. Package & Publish on PyPi Python Package Index / Python Library, etc...
# Future: Add Params: ???
def showdfQ(df, srows='m5'):
"""
Input : dataframe, parameters
Output : dataframe rows via display()
Usage : showdfQ(dataframe, params)
Author : Manoj S Bhave
Created : Tue.Jul.21 2020
Version : 0.1 (OK To Go!)
Function "showdfQ()" displays a section of dataframe (df) rows as selected by the given params in a single convenient output. Most \
Important functionality is its ability to show rows from around the Middle of df and Middle of Q1 and Q2 which not available easily \
in packaged form of a Function() or an df.attribute()
Inputs: dataframe (df)
Params: srows = 'hmtrqQ9' : Optional. Default = 'm5': Middle 5 rows shown
Each letter in the 'srows' param string is a separate parameter (max seven) and can be specified in any combination thereof. \
It takes only the First Seven characters of the param string, rest all chars are ignored. Also, it considers only ONE Digit that \
that occurs First in the param string, rest all digits are ignored:
'h' : df rows from Head = df.head()
't' : df rows from Tail = df.tail()
'r' : df rows Random = df.sample()
'm' : df rows from around Middle of df
'q' : df rows from around middle of Q1
'Q' : df rows from around middle of Q3
0..9: df rows to display (1..9) per specified criteria above. Optional. Default = '0'. Display 10 rows (Max) when '0'
NOTE: Limitations:
A. Minimum row count in the input df MUST be at least 20, otherwise when len(df) < 20 this function will not produce any output. No Show!
B. When srows is '' or does NOT contain any of the VALID chars above, then this function does not produce any output. No Show!
Outputs: Data rows from the specified dataframe (df) are displayed as per the specified params above. If all or multiple criteria \
(letters in params) are specified then the order of dataset display is as follows: (whichever is applicable based on params):
1. 'h' Head (0 thru n)
2. 'q' Q1
3. 'm' Middle
4. 'Q' Q3
5. 't' Tail
6. 'r' Random
NOTE: Limitation: The Output is limitd to Max 10 (Ten) rows in any case / combinations of chars within the 'srows' params string.
"""
if len(df) < 20: srows = ''
if len(srows) > 7: srows = srows[0:7]
srows
nrows = 10
for i in srows:
if i.isdigit():
nrows = int(i)
break
if nrows < 1 or nrows > len(df) : nrows = 10
nrows, df.shape[0]
mid = len(df)//2
low = mid-nrows//2
high = mid+1+nrows//2
qmid = len(df)//4
qlow = qmid-nrows//2
qhigh = qmid+1+nrows//2
Qmid = mid + qmid
Qlow = Qmid-nrows//2
Qhigh = Qmid+1+nrows//2
print('')
if 'h' in srows: display('Showing Head ' + str(nrows) + ' Rows:', df.head(nrows))
if 'q' in srows: display('Showing Q1 Middle ' + str(nrows) + ' Rows:', df.iloc[qlow : qhigh])
if 'm' in srows: display('Showing Middle ' + str(nrows) + ' Rows:', df.iloc[low : high])
if 'Q' in srows: display('Showing Q3 Middle ' + str(nrows) + ' Rows:', df.iloc[Qlow : Qhigh])
if 't' in srows: display('Showing Tail ' + str(nrows) + ' Rows:', df.tail(nrows))
if 'r' in srows: display('Showing Random ' + str(nrows) + ' Rows:', df.sample(nrows))
# End: Function showdfQ().
# Read & Load the input Datafile into Dataset frame: Customer DataFrame:
cdf = pd.read_excel('ALL Life Bank Credit Card Customer Data.xlsx')
cdf
My Housekeeping: Incremental DATA Backup 0: ⬇
# My Housekeeping: Incremental DF Data Backup as of now:
Markdown("### Incremental DF Data Backup 0")
cdf0 = cdf.copy() # Original Df
cdf.to_csv('cdf0.csv') # Also export as .csv file to disk
# Verify backup copy
! ls -l cdf*
cdf0.shape, type(cdf0)
cdf0.sample(7)
# Rename column names for convenience and/or meaningfulness:
cdf.head(1)
cdf.rename(columns={ 'Sl_No' : 'sn' , 'Customer Key' : 'cstid' , 'Avg_Credit_Limit' : 'crlmt' , 'Total_Credit_Cards' : 'crcrds' ,
'Total_visits_bank' : 'bvisits' , 'Total_visits_online' : 'evisits' , 'Total_calls_made' : 'calls' },
inplace=True, errors='raise')
cdf.head(1)
My Housekeeping: Incremental DATA Backup 1: ⬇
# My Housekeeping: Incremental DF Data Backup as of now:
Markdown("### Incremental DF Data Backup 1")
cdf1 = cdf.copy() # DF Modified: Changed col names
cdf.to_csv('cdf1.csv') # Also export as .csv file to disk
# Verify backup copy
! ls -l cdf*
cdf1.shape, type(cdf1)
cdf1.sample(7)
My Housekeeping: Incremental CODE Backup 1: ⬇
### My Housekeeping: Incremental Jupyter Code File Backup 1 as of now ^^^
Markdown("### Incremental Jupyter Notebook Code Backup 1")
! cp "Project 5 USL KMC & HCluster ALL Life Bank.ipynb" \
"Project 5 USL KMC & HCluster ALL Life Bank 1.ipynb"
! ls -l Project*.ipynb
profile = pandas_profiling.ProfileReport(cdf)
profile
Actions From the Above Profile Report:
Observations From the Above Profile Report:
# As Observed & Determined above: Showing 10 rows below, with 5 Duplicate Pairs of 'cstid' values: Partial Duplicate Row Kept as they are:
# These 'custid' (Customers) have multiple entries (records/observations):
# BUT with different attribute/col values for: 'crlmt', 'crcrds', 'bvisits', 'evisits', 'calls'
# AS determined above, these are "partial" duplicates rows and NOT TRUE 'full' duplicates, so we will NOT 'combine' NOR DROP them.
cdf[(cdf['cstid'].duplicated(keep=False))].sort_values(by=['cstid'])
# Action Item(s) as Determined above: DROP column 'sn' (Serial Number), 'cstid' (Customer ID):
cdf.drop(['sn', 'cstid'], axis = 1, inplace = True)
cdf.head()
My Housekeeping: Incremental DATA Backup 2: ⬇
# My Housekeeping: Incremental DF Data Backup as of now:
Markdown("### Incremental DF Data Backup 2")
cdf2 = cdf.copy() # DF Modified: DROPPED cols 'sn', 'cstid'
cdf.to_csv('cdf2.csv') # Also export as .csv file to disk
# Verify backup copy
! ls -l cdf*
cdf2.shape, type(cdf2)
cdf2.sample(7)
# Custom Function "describeX" is an extension of regular df.describe(). It includes these additional features for all df columns:
# * "unq" = Unique values count * "null" = Null values count * "zero" = Zero values count * "skew" = Skewness of col data distribution.
# * "mode1" = Mode value of col. * "mode2" = 2nd Mode value of col, if it exists (Max). * "modeN" = Total NUMBER of modes for col.
# * "OutL" = Number of Outliers Lower/Less than Q1-IOR*1.5 of col. * "OutH" = Number of Outliers Higher/Less than Q3+IOR*1.5 of col.
#
# * "Rows" : Number of Rows in df * "Cols" : Number of Columns in df
# * "DUP.Rows" : Duplicate Rows in df * "DUP.Cols" : Duplicate Columns in df
# * "Elements" : Number of values in df (rows x cols)
# * "Outliers" : Note on "OutL", "OutH" cols in above Stats Table
# Function definition is at the top, after libraries. Written by self (Manoj S Bhave) and to be published on PyPi, Python Libs, etc...
# For more details use Shift+Tab on this function below:
describeX(cdf, dcols=True, srows='m0')
My Housekeeping: Incremental CODE Backup 2: ⬇
### My Housekeeping: Incremental Jupyter File Code Backup 2 ⬆
Markdown("### Incremental Jupyter Notebook Code Backup 2 ⬆")
! cp "Project 5 USL KMC & HCluster ALL Life Bank.ipynb" \
"Project 5 USL KMC & HCluster ALL Life Bank 2.ipynb"
! ls -l Project*.ipynb
Observations from the df stats above:
| Degree | Lower | Upper | Type | Comments |
|---|---|---|---|---|
| High | < −1 | > +1 | Asymmetric | Outside of Lower & Upper |
| Moderate | −1.0 & −0.5 | +0.5 & +1.0 | Asymmetric | Within Lower OR Upper |
| Low | −0.5 | +0.5 | Asymmetric | Within Lower AND Upper |
| Very Low | −0.25 | +0.25 | Symmetric | Within Lower AND Upper |
| No Skew | 0.0 or | near +0.0- | Symmetric | Almost zero (+ or -) |
ACTION Items From The Above Observations: Impute Outliers: 'crlmt', 'evisits' which are beyond upper/higher wisker: List their values:
# Display & Impute Outliers for cols 'crlmt', 'evisits' as determined in ACTION item(s) above:
# Custom Function showOutL() Detects, Display and/or Imputes Outliers by IQR method.
# Function definition is at the top, after libraries. Written by self (Manoj S Bhave) and to be published on PyPi, Python Libs, etc...
# For more details use Shift+Tab on this function below:
# Now Use the Custom Function showOutL() in "Display Only" mode (NO Impute):
showOutL(cdf) # list Outliers
p = plt.figure(figsize=(20,7))
g = cdf.boxplot() # Visualize Outliers
# Now Remove/Impute these Outliers by IQR Method: 'crlmt' and 'evisits' 39 and 37 Outliers respectively:
# Now Use the Custom Function showOutL() in "Impute" and then "Display" mode:
showOutL(cdf, mthd='iqr', impute=True) # For more details Shift+Tab on Function Name or see Function Defiition at the Top
p = plt.figure(figsize=(20,7))
g = cdf.boxplot() # Visualize Outliers after Imputations
# Verify that the Outlier for cols 'crlmt', 'evisits' have been removed and No New Outliers were created due to Imputations:
describeX(cdf, dcols=True)
My Housekeeping: Incremental DATA Backup 3: ⬇
# My Housekeeping: Incremental DF Data Backup 3:
Markdown("### Incremental DF Data Backup 3")
cdf3 = cdf.copy() # DF Modified: Imputed 'cstlmt' (39), 'evisits' (37)
cdf.to_csv('cdf3.csv') # Also export as .csv file to disk
# Verify backup copy
! ls -l cdf*
cdf3.shape, type(cdf3)
cdf3.sample(5)
My Housekeeping: Incremental CODE Backup 3: ⬇
### My Housekeeping: Incremental Jupyter File Code Backup 3 ⬆
Markdown("### Incremental Jupyter Notebook Code Backup 3 ⬆")
! cp "Project 5 USL KMC & HCluster ALL Life Bank.ipynb" \
"Project 5 USL KMC & HCluster ALL Life Bank 3.ipynb"
! ls -l Project*.ipynb
As per the Imputation Report, BoxPlot, DescribeX Stats Table Above: All Outliers Eliminated ('crlmt', 'evisits'). No New Oultiers Created.
However 22 New Duplicate Rows have emerged. Originally there were only 5 Dup Rows. After dropping 'cstid' we got six (6) additional dup.rows totalling 11 dup.rows. After imputing outliers we got 11 more dup rows totalling 22 dup rows.
This is inevitable as we drop features and/or change col values via imputations/etc. Since this number is not significant, 22 rows in 660 total rows (3.33%), we will leave these dup.rows as it is.
From the BoxPlot, Data Stats and the Data Sampple Displayed above, it is evident that there is major disparity in the Unit Of Measure among the features / variables. Column 'crlmt' is represented in Currency (Money values) going upto max 6 digits numbers, while the rest four (4) columns are plain Number Counts upto max 2 digits. Hence we MUST scale the dataset to bring all column values to a common level field / scale. We will use ZSCORE for scalling:
# Individual Histogram of ALL 5 Numerical Columns: Visual Distribution of column values:
p = plt.figure(figsize=(20, 20))
pos = 1
for col in cdf.columns:
p = plt.subplot(3, 3, pos)
g = sns.distplot(cdf[col]) # For Each of the 5 Cols in the df
pos += 1
p = plt.subplot(3, 3, pos)
g = sns.distplot(cdf) # For the entire dataset df
Observations: From the Distribution Plots above:
Markdown('### * Correlation Matrix:')
cdf.corr()
Markdown('### * Correlation HeatMap Matrix:')
p = plt.figure(figsize=(11, 7))
g = sns.heatmap(cdf.corr())
Correlation Observations From above matrix:
Since these are NOT HIghly Correlated like >0.75+ or so we will NOT DROP these columns. Also there are not too many cols.
Markdown('##### * PairPlot BiVariate Study among Predictor variables with Density Curves (diag_kind="kde")')
g = sns.pairplot(cdf, diag_kind='kde')
From the PairPlot above we reconfirm our earlier conclusions that:
# Additional Exhibit 1:
# Density, Histogram, Scatter plots of All Attributes Interactions:
# Different Upper & Lower triangles: eg. Scatter & KDE (Density)
Markdown('### * PairGrid BiVariate Study among Predictor variables')
Markdown('* <u> Upper Half</u> : Linear Regression Fitted line thru Scatter Plot')
Markdown('* <u> Lower Half</u> : Kernel Densities among Atributes')
g = sns.PairGrid(cdf)
g = g.map_upper(sns.regplot)
g = g.map_lower(sns.kdeplot)
g = g.map_diag(plt.hist, lw=2)
plt.show()
The Density PairGrid above ("Island" like structures) indicated data concentrations and its spread. The "islands" potentially indicates groups / clusters. It seems like Two or Three Clusters seems most consistent in the dataset. We will further derive that as we progress into KMeans and Hierarchical Clustering.
We will now scale the entire dataset using ZSCORE, as we had determined earlier, due the fact that the variables are on a mixed Unit Of Measure and the number greatly vary in terms of Size (Magnitude). For eg. 'crlmt' number goes upto 6 digits while all other variables go up to 2 digit numbers. So all variables need to be brought to a common scale and hence get rid of the Individual Unit Of Measurements.
cdf = cdf.apply(zscore)
cdf.head()
describeX(cdf, dcols=True, srows='r0')
My Housekeeping: Incremental DATA Backup 4: ⬇
# My Housekeeping: Incremental DF Data Backup 4:
Markdown("### Incremental DF Data Backup 4")
cdf4 = cdf.copy() # DF Modified: 1. Imputed 'cstlmt' (39), 'evisits' (37) ; 2. SCALED with ZSORE
cdf.to_csv('cdf4.csv') # Also export as .csv file to disk
# Verify backup copy
! ls -l cdf*
cdf4.shape, type(cdf4)
cdf4.sample(6)
p = plt.figure(figsize=(20,7))
g = cdf.boxplot() # Visualize Outliers after Imputations
My Housekeeping: Incremental CODE Backup 4: ⬇
### My Housekeeping: Incremental Jupyter File Code Backup 4 ⬆
Markdown("### Incremental Jupyter Notebook Code Backup 4 ⬆")
! cp "Project 5 USL KMC & HCluster ALL Life Bank.ipynb" \
"Project 5 USL KMC & HCluster ALL Life Bank 4.ipynb"
! ls -l Project*.ipynb
# Find Optimum Number of Clusters: # Note: Data is SCALED with zscore and NO Outliers (i.e. Data is "Conditioned"):
# Now Create Clusters:
cluster_range = range(1,15)
cluster_errors = []
for k_clusters in cluster_range:
clusters = KMeans(k_clusters, n_init = 5)
clusters.fit(cdf) # Data Scaled with zscore
labels = clusters.labels_
centroids = clusters.cluster_centers_
cluster_errors.append(clusters.inertia_)
clusters_df = pd.DataFrame({"k_clusters": cluster_range, "cluster_errors": cluster_errors})
clusters_df[0:15]
# Now do the "Elbow Plot": Very IMP in helping us decide the Optimal Number Of Clusters:
plt.figure(figsize=(12,6))
plt.plot( clusters_df.k_clusters, clusters_df.cluster_errors, marker = "o" )
The "Big Bend" is at "4" ⬆ in the above Elbow Plot, which indicates that 4 is the Optimal Number of Clusters
# Since the Sharp / Big Bend is at 4 (Optimal Clusters), lets create 4 Custers:
kmeans = KMeans(n_clusters=4, n_init = 5, random_state=6)
kmeans.fit(cdf)
# Number of data items in each cluster:
labels = kmeans.labels_
counts = np.bincount(labels[labels>=0])
print(counts)
# In the 3rd and 4th cluster above, the data distribution look uneven.
# Lets create 3 cluster now. Maybe the the last two cluster might merge and give us a better data distribution among the clusters
kmeans = KMeans(n_clusters=3, n_init = 10, random_state=6)
kmeans.fit(cdf)
# Get the Number of data items in each cluster:
labels = kmeans.labels_
counts = np.bincount(labels[labels>=0])
print(counts)
⬆ The data distribution above appears quite uneven across the 4 clusters and 3 clusters produced after kmeans n_clusters (k=3 & k=4). Hence lets try some feature enginering on the columns:
⬇ Tried running kmeans on various combinations of features by merge various features/columns as follows: (Outputs of each kmeans run for a combination has been removed to reduce notebook clutter:
# * Tried various combinations of two of these three cols bvisits, evisits, calls at a time:
# * Combine bvisits + evisits into one col 'bevisits' :
# cdf_combo = pd.DataFrame(cdf.crlmt)
# cdf_combo['crcrds'] = cdf.crcrds
# cdf_combo['bevisits'] = cdf.bvisits + cdf.evisits
# cdf_combo['calls'] = cdf.calls
# cdf_combo
# cdf_combo = cdf.bvisits + cdf.bvisits
# * Combine evisits + calls in to one columns...etc. (Code removed to reduce clutter)
# * Combine bvisits + evisits + calls into one col 'xvisits' :
# cdf_combo = pd.DataFrame(cdf.crlmt)
# cdf_combo['crcrds'] = cdf.crcrds
# cdf_combo['xvisits'] = cdf.bvisits + cdf.evisits + cdf.calls
# cdf_combo
# cdf_combo = cdf.copy()
# cdf_combo.pop('calls')
# cdf_combo
# cdf_combo = pd.DataFrame(cdf.crlmt)
# cdf_combo['crcrds'] = cdf.crcrds
# cdf_combo
# cdf_combo = cdf3.copy() # UnSCALED !!!
# np.log(cdf_combo.crlmt)
# cdf_combo.crlmt = np.log(cdf_combo.crlmt)
# cdf_combo['xvisits'] = cdf.bvisits + cdf.evisits + cdf.calls
Experiment: Ran kmeans on various above combinations for various of values n_clusters (k = 3 thru 8) to produce various number of Clusters.
Conclusion: Best (most even / balanced) data distribution across clusters was obtained :
ACTION: FINAL: Considering points above, we will hence proceed with :
# Let prepare the data as per above Decided Final Action:
cdf_combo = pd.DataFrame(np.log(cdf3.crlmt)) # This col 'crlmt' is Log SCALED !!!
cdf_combo['crcrds'] = cdf3.crcrds # This col 'crcrds' is UnSCALED !!!
cdf_combo['xvisits'] = cdf3.bvisits + cdf3.evisits + cdf3.calls # These three cols are UnSCALED !!!
print('\n* cdf_combo df data:')
cdf_combo
p = plt.figure(figsize=(20,7))
g = cdf_combo.boxplot() # After Log Scaling column 'crlmt' only (only 1 column scaled); Plot to ckeck distribution
showOutL(cdf_combo) # Check this new df for any Outliers
describeX(cdf_combo) # get df vital stats for this new df
print('\n* See Above^: The Min & Max values of the between 1 & 16. The Mean & SD values are not too far away from each other.\n')
# Find Optimum Number of Clusters: # Note: Data is SCALED with zscore and NO Outliers (i.e. Data is "Conditioned"):
# Now Create Clusters:
cluster_range = range(1,15)
cluster_errors = []
for k_clusters in cluster_range:
clusters = KMeans(k_clusters, n_init = 5, random_state=6)
clusters.fit(cdf_combo)
labels = clusters.labels_
centroids = clusters.cluster_centers_
cluster_errors.append(clusters.inertia_)
clusters_df = pd.DataFrame({"k_clusters": cluster_range, "cluster_errors": cluster_errors})
clusters_df[0:15]
# Now do the "Elbow Plot": Very IMP in helping us decide the Optimal Number Of Clusters:
plt.figure(figsize=(12,6))
plt.plot( clusters_df.k_clusters, clusters_df.cluster_errors, marker = "o" )
⬆ Box Plot: Data Distribution look OK in the Dataframe after Scaling and Featuring Enginering
⬆ Elbow Plot: Best Elbow is still at "4" (Four)
👍 This processed df is Good To Go and will proceed accordingly with further kmeans process:
# On df cdf_combo, lets create 4 Custers:
kmeans = KMeans(n_clusters=4, n_init = 3, random_state=6)
kmeans.fit(cdf_combo)
# Number of data items in each cluster:
labels = kmeans.labels_
counts = np.bincount(labels[labels>=0])
print(counts)
sns.distplot(counts, rug=True)
⬆ Data items seem Normally Distributed across the above 4 clusters. Looks pretty much even/balanced & its good now! 👍
# Will stay and work futher with this df (cdf_combo):
# Col 'crlmt' : log scaled; Cols 'crcrd', 'xvisits' : UnScaled'
# Note: Col 'xvisits' (derived from original cols) = 'bvisits' + 'evisits' + 'calls'
# Lets check the Centers in each group/cluster
centroids = kmeans.cluster_centers_
centroid_df = pd.DataFrame(centroids, columns = list(cdf_combo) )
centroid_df.transpose()
Observations / Conclusions: Group/Cluster Analysis derived from above table:
Observations:
Conclusions:
The above confirms as seen in Correlation & Pairplot: Customers with High Credit Limits / More Credit Cards, they have LESS Contact with the Bank. And Vice Versa: Customers who have less Credit Limits & Cards, have MORE contact with the Bank.
Group 0 Customers appears to be either New to Bank or are New or Less Experienced Credit User (maybe like younger customers, etc., because the have LESS Credit. And since they are "New" in some ways they tend to come in Bank's "Contact" MORE often then other the "Experience" groups which potentially indicated by their Total Credit & Card count which generally comes with to Customers who have good past Credit i.e who are "Experienced" in using / handlding Credit products.
Groups 1 & 2 seem to be "Experienced" Credit users and seems to know their Bank's working and product, hence they seem to "Contact" the Bank LESS often then Group 0 and Group 3. OR Probably they know that there is NOT MUCH VALUE is the Bank's Customer Service OR maybe they know pretty much everything the Bank's Customer Service has to offer them.
# Add Cluster's Group / Label Number to Original dataset saved in dataframe 'cdf0':
cdf_kmGrp_labeled = cdf0.copy() # restore the original df in a new df for labeling with the newly indetified kmeans groups/labels
predictions = kmeans.predict(cdf_combo)
predictions
cdf_kmGrp_labeled["group"] = predictions
cdf_kmGrp_labeled['group'] = cdf_kmGrp_labeled['group'].astype('category')
print('\n* Original unprocess dataset below, restored in a new df "cdf_kmGrp_labeled" is now labled in the new column "group": \n')
cdf_kmGrp_labeled.info()
cdf_kmGrp_labeled.head()
# Export this KMeans Cluster Labeled dataset df into a .csv disk file for any further analysis:
writer = ExcelWriter('ALL Life Bank CrCrd CustData Kmeans Cluster Group Labeled.xls')
cdf_kmGrp_labeled.to_excel(writer,'Sheet1')
writer.save()
! ls -l *.xls
# Visualize the Clusters / Centers:
# Add a new column "group" to the current / processed dataset df to and populate it with the kmeans derived Cluster Group Label
cdf_combo["group"] = predictions
cdf_combo.head()
# cdf_combo.boxplot(by=cdf_combo.group) # Doesnt work (bug?) matplotlib v3.3 (installed), needs lower version, but I dont wnat mess up!
cdf_combo.groupby('group').boxplot(layout=(3,4), figsize=(20, 25)) # matplotlib v3.3 issue: Hence using this method for boxplot
# cdf_combo_grp = cdf_combo.groupby(cdf_combo.group) # .boxplot(layout=(3,4), figsize=(20, 25))
# cdf_combo_grp # .boxplot(layout=(3,4), figsize=(20, 25))
⬆ From the Box Plots above for each of the 4 Group/Clusters we can recofirm our that our earlier Observations & Conclusions about these clusters were Correct 👍
# More visualizations to confirm our earlier findings about the clusters just after the cluster were finilazed after creations:
# Now we can better see the Groups / Clusters identified by color coded keys / legend
# We are now able to see multiple Gaussians (curves/humps) that were hidden earlier before the kmeans cluster identification.
# Gaussians indicate the Groups/Clusters that we just identified.
# We can also see the Groups/Clusters in the Scatter plots below that were hidden earlier before the kmeans cluster identification.
sns.pairplot(cdf_combo, hue='group')
# PairPlot for the Original data with kmeans Clusters Labels:
# Compare the plot below with that of PairPLot which we plotted earlier just after UniVariate Analysis, before kmeans Analysis.
# At the time, it was not easy to decode an groups with it. Now the groups are much clearer with color coding identification.
# We can now see the multiple Gaussians (curves/humps) that were hidden earlier before the kmeans cluster identification.
# Gaussians indicate the Groups/Clusters that we just identified.
# We can also see the Groups/Clusters in the Scatter plots below that were hidden earlier before the kmeans cluster identification.
# This was made possible only the help of K Means Clustering technique!
sns.pairplot(cdf_kmGrp_labeled, hue='group')
# Dataset in three-dimensional Scattered Plot: More Visual Aid to Confirm our earlier findings:
# Notice the Groups/Clusters in the 3D Scatter plot below that were hidden in 2D in above plots.
ax = plt.axes(projection='3d')
zdata = cdf_combo.crlmt
xdata = cdf_combo.crcrds
ydata = cdf_combo.xvisits
ax.scatter3D(xdata, ydata, zdata, c=cdf_combo.group);
Use Elbow Plot and Analyse Clusters using BoxPlot:
Performed all steps and derived the required parameters & data to plot the ELBOW PLOT as exhibited in this section above. This helped determining the value for "K" as identified by the "Big" or "Sharp" Bend in the ELBOW PLOT above.
Performed all steps and derived the required parameters, data and the Cluster Group Labels to plot the BOX PLOT for all 3 dataset attributes grouped by each of the 3 GROUP/Clusters. This is exhibited in this section above.
Cluster Analysis based on BoxPlots and other derivatives / techniques have been provided above in the Markdown text box cell in this section above.
My Housekeeping: Incremental CODE Backup 5: ⬇
### My Housekeeping: Incremental Jupyter File Code Backup 5 ⬆
Markdown("### Incremental Jupyter Notebook Code Backup 5 ⬆")
! cp "Project 5 USL KMC & HCluster ALL Life Bank.ipynb" \
"Project 5 USL KMC & HCluster ALL Life Bank 5.ipynb"
! ls -l Project*.ipynb
cdf_combo.head()
# Prepare data dfs for HC runs:
custDataAttr = cdf0.copy() # Restore Original saved df ("cdf0") in a new df for labeling with HC cluster label numbers
cdf_combo_kmGrp = cdf_combo.copy() # Save df labeled with Kmeans Cluster groups
# Remove col 'group' labeled by KMeans to retrieve our original working df, so that we can the same processed df for HC:
cdf_combo.drop(['group'], axis = 1, inplace = True)
print('\n* custDataAttr df head rows: Original data restored / copied in this df for HC labelling:')
custDataAttr.head()
print('\n* cdf_combo df head rows: "Originalized" processed df: Previous kmeans label (col: "group") removed for HC runs / processing:')
cdf_combo.head()
# Executing HC Agglomerative for Clusters = 4 (same as in kmeans above), data = cdf_combo (same data processed as in kmeans above)
# Tried running HC for n_clusters = 2, 3, 4, 5. Best data point distribution was with n_cluster = 4 (Code removed to reduce clutter)
# Hence will run HC for various values of 'linkages" (Proj Req.) with n_clusters = 4 (Cluster to find):
# Cluster results df containing data item counts each Cluster by for various values of linkages: 'single', 'complete', 'average', 'ward'
HCresults = pd.DataFrame()
lnkTyp = ['single', 'complete', 'average', 'ward']
for lnk in lnkTyp:
print('HC for linkage = ', lnk)
HCmodel = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage=lnk)
HCmodel.fit(cdf_combo)
custDataAttr['labels'] = HCmodel.labels_
HCresults[lnk] = custDataAttr.groupby(["labels"]).count()['Sl_No'] # use any df col to count() items in clusters for current linkage
HCresults
⬆ Among all 4 method above, with linkage='complete' method, data seems most normally / evenly distributed within each of the 4 clusters/labels above.
⬇ Hence, lets use 'complete' linkage method to summarize using mean() & to view summarized data values by each column within each clusters/labels:
# The last iteration of HC was for 'complete' linkage, so the labels on the dataset are still in place.
# So lets summarize the Bank's Credit Card Customer Data from the existing df which is already labeled from HC 'complete' linkage method:
print('\n* Head df data with HC "labels" using "complete" linkage: ')
custDataAttr.head() # Show sample data with labels
print('\n* Within Cluster/Group summarized (grouped by HC "label") "mean()" values of Customer Attributes using "complete" linkage: ')
custDataAttr.iloc[:, 2:].groupby(["labels"]).mean() # Skip first 2 cols ('Sl_No', Customer Key) for summary using mean()
⬆ Credit Limit seems somewhat Directly Proportional to Number of Credit Cards a Customer has.
⬆ Customer's Number of Bank Contacts (visits + online + calls) seems somewhat Inversively Proportional to the Credit Limit.
# Plot Box plots for all 4 HC clusters:
cdf_combo.join(custDataAttr.labels).groupby(['labels']).boxplot(layout=(1,4), figsize=(20, 7))
plt.show();
⬆ The boxplots above reconfirms our earlier observations (just above) & detailed Observations & Conclusions (stated in details earlier, much above in kmeans section).
My Housekeeping: Incremental CODE Backup 6: ⬇
### My Housekeeping: Incremental Jupyter File Code Backup 6 ⬆
Markdown("### Incremental Jupyter Notebook Code Backup 6 ⬆")
! cp "Project 5 USL KMC & HCluster ALL Life Bank.ipynb" \
"Project 5 USL KMC & HCluster ALL Life Bank 6.ipynb"
! ls -l Project*.ipynb
⬇ Now we will do HC with various linkage methods using Dendrograms & Cophenetic Coefficients and Visualize the Results :
# Run HC Clustering, Plot Dendogram, Compute Cophenetic Coeff & Silhouette Scores...
# ...for various types of linkages: 'single', 'complete', 'average', 'centroid', 'ward' :
max_d = 2 # Using Default value for fcluster(...'depth'...)
lnkTyp = ['single', 'complete', 'average', 'centroid', 'ward'] # linkage methods to be used
HC_CopSil_rslt = pd.DataFrame(['Cophenetic Coeff', 'Silhouette Score']) # Results df for Cop.Coeffs & Sil.Scores for various linkages:
# function to plot dendogram to be called from inside the loop:
def dendo(Z, lnk):
plt.figure(figsize=(25, 10))
plt.title(lnk + ' linkage Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('All Life Bank Credit Card Customer Data')
plt.ylabel('Distance')
dendrogram(Z)
# Loop to plot dendogram for each type of linkages in list 'lnkTyp':
for lnk in lnkTyp:
Z = linkage(cdf_combo, metric='euclidean', method=lnk)
c, coph_dists = cophenet(Z , pdist(cdf_combo))
clusters = fcluster(Z, max_d, criterion='distance')
# clusters
s = silhouette_score(cdf_combo, clusters)
HC_CopSil_rslt[lnk] = [c, s]
print('\n* HC for linkage =', lnk, '\n = Coph.Coeff:', round(c, 5), '\n = Silh.Score:', round(s, 5))
dendo(Z, lnk)
print('\n* HC Summary: Cophenetic Coeffs & Silhouette Scores for various HC linkage methods:')
HC_CopSil_rslt
# Display HC results df for both computed metrics: Coph.Coeffs & Silh.Scores for all Linkage Methods used:
HC_CopSil_rslt
⬆ From the above table: linkage = "ward" method seems BEST for both Coph.Coeff & Silh.Score (values closer to "+1" forms Best Clusters.
⬇ We will proceed to further steps with the selected dendogram that is produced by using the "ward" linkage method:
# Backup / Save HC Z, HC Clusters just created by HC 'ward' linkage just in case we need them, so we dont have to run the process again:
# The last / most recent dendogram was for "ward". So "Z" and "Cluster" will have HC values from "ward" method: Save them:
Z_HCward_save = Z
print('\n* HC: Final HC Z Values for linkage = "ward" :')
Z_HCward_save
clusters_HCward_save = clusters
print('\n* HC: Final HC Clusters for linkage = "ward" :')
clusters_HCward_save
⬇ Truncate the selected dendogram ('ward') at various cluster levels and visualize, so that we can select the best Number and Type of Clusters that are formed:
# "Z" already have HC values from the desired / selected method "ward". So we will use "Z" directly here:
# Use truncate_mode='lastp' in dendrogram() to arrive at a final pruned workable dendrogram with required Clusters:
# Try selecting 6 Cluster ('p' value) and select the Clsuters that were merged last (truncate_mode='lastp'):
dendrogram(Z, truncate_mode='lastp', p=6); # Show only the last p merged clusters
# "Z" already have HC values from the desired / selected method "ward". So we will use "Z" directly here:
# Use truncate_mode='lastp' in dendrogram() to arrive at a final pruned workable dendrogram with required Clusters:
# Try selecting 6 Cluster ('p' value) and select the Clsuters that were merged last (truncate_mode='lastp'):
dendrogram(Z, truncate_mode='lastp', p=5); # Show only the last p merged clusters
# "Z" already have HC values from the desired / selected method "ward". So we will use "Z" directly here:
# Use truncate_mode='lastp' in dendrogram() to arrive at a final pruned workable dendrogram with required Clusters:
# Try selecting 6 Cluster ('p' value) and select the Clsuters that were merged last (truncate_mode='lastp'):
dendrogram(Z, truncate_mode='lastp', p=4); # Show only the last p merged clusters
# "Z" already have HC values from the desired / selected method "ward". So we will use "Z" directly here:
# Use truncate_mode='lastp' in dendrogram() to arrive at a final pruned workable dendrogram with required Clusters:
# Try selecting 6 Cluster ('p' value) and select the Clsuters that were merged last (truncate_mode='lastp'):
dendrogram(Z, truncate_mode='lastp', p=3); # Show only the last p merged clusters
⬆ Dengogram with p=5 truncation has the BEST Data Distribution among the Clusters, Followed by p=4
My Housekeeping: Incremental CODE Backup 7: ⬇
### My Housekeeping: Incremental Jupyter File Code Backup 7 ⬆
Markdown("### Incremental Jupyter Notebook Code Backup 7 ⬆")
! cp "Project 5 USL KMC & HCluster ALL Life Bank.ipynb" \
"Project 5 USL KMC & HCluster ALL Life Bank 7.ipynb"
! ls -l Project*.ipynb
# Calculate Silhouette Scores for KMeans Clusters and save results in a df:
kmc_silh_score_df = []
print('\n⬇ Silhouette Score for Kmeans Clusters by Number Of Clusters (n_clusters) :\n')
for n_clusters_kmc in range(2,15):
clusters_kmc = KMeans(n_clusters=n_clusters_kmc, n_init=5, random_state=6)
preds_kmc = clusters_kmc.fit_predict(cdf_combo)
centers_kmc = clusters_kmc.cluster_centers_
kmc_silh_score = silhouette_score(cdf_combo, preds_kmc)
kmc_silh_score_df.append([n_clusters_kmc, kmc_silh_score])
print(' * KMeans n_clusters = ', n_clusters_kmc, 'Silhouette Score: ', kmc_silh_score)
kmc_silh_score_df = pd.DataFrame(kmc_silh_score_df, columns=(['kmc_n_clusters', 'kmc_silh_score']))
kmc_silh_score_df.set_index('kmc_n_clusters', inplace=True)
# Format & Display Silhouette Scores for KMeans Clusters & Hierarchial Clusters from their respective results df :
# Also: Compute & Display the AVERAGE mean() Silhouette Scores for All Clusters (KMC) & All Linkage Methods (HC):
pd.options.display.float_format = '{:,.6f}'.format
print('\n⬇ Silhouette Scores for KMeans by Number of Clusters: ( ⮕ Average Silh. Score For All Clusters:',
round(kmc_silh_score_df.mean()[0],6), '⬅ )' )
kmc_silh_score_df.T
print('\n⬇ Silhouette Scores for Hierarchial Clusters by Linkage Method: ( ⮕ Average Silh. Score For All Methods:',
round(HC_CopSil_rslt.T.iloc[1:, 1:].mean()[1],6), '⬅ )' )
# Display from previously calculated and saved HC Silh. Scores from this df: Filter out 1st df row for Coph. Coeff by using df[1:] ):
HC_CopSil_rslt[1:]
pd.options.display.float_format = '{:,.2f}'.format
Calculated Silhouette Scores as exhibited above for:
KMeans n_clusters = 2...15 : Results tabulated above
⮕ Also derived the AVERAGE Silh. Score = 0.338939 which is mean() of all these kmeans silh. scores
Hierarchial linkage = (single, complete, average, centroid, ward)
⮕ Also derived the AVERAGE Silh. Score = 0.316871 which is mean() of all these Hierarchial silh. scores
Conclusion: The AVERAGEed Silhouette Scores for KMeans & Hierarchial are not SAME, but they are CLOSE & SIMILAR to each other
My Housekeeping: Incremental CODE Backup 8: ⬇
### My Housekeeping: Incremental Jupyter File Code Backup 8 ⬆
Markdown("### Incremental Jupyter Notebook Code Backup 8 ⬆")
! cp "Project 5 USL KMC & HCluster ALL Life Bank.ipynb" \
"Project 5 USL KMC & HCluster ALL Life Bank 8.ipynb"
! ls -l Project*.ipynb
# Apply HC Labels to cdf_combo (our working dataset df) from the already HC labeled custDataAttr df:
# This is create a a HC labeled from cdf_combo for our furthing workings / study / analysis:
custDataAttr.head()
cdf_combo.head()
cdf_combo_HC_lbl = cdf_combo.join(custDataAttr.labels)
cdf_combo_HC_lbl.head()
# Display All 5 Features for each of the 4 Clusters formed by KMeans & Hierarchial Methods:
# NOTE: Omitted Customer IDs columns like 'Sl_No', 'Customer Key' are not Displayed here as they are NOT relevant for this study:
print('\n⬇ Customer Features: mean() values by Clusters: Grouped by KMeans Cluster Numbers: "group" : ')
cdf_kmGrp_labeled.iloc[:, 2:].groupby(["group"]).mean() # Skip first 2 cols ('Sl_No', Customer Key) for summary using mean()
print('\n⬇ Customer Features: mean() values by Clusters: Grouped by Hierarchial Cluster Numbers: "labels" : ')
custDataAttr.iloc[:, 2:].groupby(["labels"]).mean() # Skip first 2 cols ('Sl_No', Customer Key) for summary using mean()
⬆ Compare each Feature within each Cluster WITH the same Feature IN both KMeans & Hierarchical Clusters ⬆
⬇ Tabulate the Findings:
| KMC:⬇ / HC:⮕ | H.Cluster 0 | H.Cluster 1 | H.Cluster 2 | H.Cluster 3 | Comments |
|---|---|---|---|---|---|
| KM.Cluster 0 | Similar ? | SIMILAR ✓ | HC 1 ≈ "Similar?" to KMC 0 & to its own HC 3 | ||
| KM.Cluster 1 | SIMILAR ✓ | Cluster's Individual Feature values are similar | |||
| KM.Cluster 2 | SIMILAR ✓ | Cluster's Individual Feature values are similar | |||
| KM.Cluster 3 | Distinct ✘ | Distinct ✘ | Distinct ✘ | Distinct ✘ | KMC 3 doesn't seem "Similar" to any H.Cluster |
My Housekeeping: Incremental CODE Backup 9: ⬇
### My Housekeeping: Incremental Jupyter File Code Backup 9 ⬆
Markdown("### Incremental Jupyter Notebook Code Backup 9 ⬆")
! cp "Project 5 USL KMC & HCluster ALL Life Bank.ipynb" \
"Project 5 USL KMC & HCluster ALL Life Bank 9.ipynb"
! ls -l Project*.ipynb
Key Questions & Answers:
⬇ Analyze & Visualize Clusters in KMC .OR. in KC to Differentiate them from another:
⬇ Select KMC or HC Clusters based on their respective Silhouette Scores:
# Select KMC or HC Clusters based on their Silhouette Scores:
# Exhibit 1: Display All previously derived & saved Silhouette Scores For KMeans & Hierarchical Clusters:
pd.options.display.float_format = '{:,.6f}'.format
print('\n⬇ Silhouette Scores for KMeans by Number of Clusters:')
kmc_silh_score_df.T
print('\n⬇ Silhouette Scores for Hierarchial Clusters by Linkage Method:')
HC_CopSil_rslt[1:]
pd.options.display.float_format = '{:,.2f}'.format
⬆ From Exhibit 1 above, the Highest Silhouette Score for KMC is 0.48 (for 2 Clusters) and for HC its is 0.53 (for 4 Clusters).
# Visualize and Analyze all Clusters to Differentiate Clusters from another:
# Show Clusters with all its Features HC (finalized method) for a side by side Differentiation
print('\n* Hierarchial Clusters labeled by index col "labels" : ')
cdf_combo_HC_lbl.groupby(["labels"]).mean().T
# Visual 1: Bar Plot for above grouped data for visual Cluster Differentiation Side by Side:
# Bar Plots for HC Clusters from above data df with the Cluster Features: Grouped by cluster number labels 'labels' :
print(* ' ⬇Heirarchial Clusters: ("labels")')
cdf_combo_HC_lbl.groupby(["labels"]).mean().plot(kind="bar");
# Visual 2: For above grouped data for visual Cluster Differentiation Side by Side:
# Visual 2: Pair Plot for above grouped data for visual Cluster Differentiation Side by Side:
# Pair Plots for HC Clusters from above data df with the Cluster Features: Grouped by cluster number labels 'labels' :
print(* ' ⬇Heirarchial Clusters:("labels")')
sns.pairplot(cdf_combo_HC_lbl, hue='labels');
plt.show();
# Exhibit 2: Display HC Clusters with its Original Attributes:
print('\n⬇ Customer Features: mean() values by Clusters: Grouped by Hierarchial Cluster Numbers: "labels" : ')
custDataAttr.iloc[:, 2:].groupby(["labels"]).mean() # Skip first 2 cols ('Sl_No', Customer Key) for summary using mean()
⬆ From all the above Exhibits (1 & 2) and Visuals (1 & 2), Here are the Answers to the Key Questions:
⬇ Differentiate Clusters from another:
⬇ Key Questions & Answers:
Key Questions & Answers: